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Abstract 


We propose DeepMemory, a novel deep architecture for sequence-to-sequence learning, 
which performs the task through a series of nonlinear transformations from the representation 
of the input sequence (e.g., a Chinese sentence) to the final output sequ ence (e.g., translation 
to English). Inspired by the recently proposed Neural Turing Machine ( [Graves et aH 2014| ), 


we store the intermediate representations in stacked layers of memories, and use read-write 
operations on the memories to realize the nonlinear transformations between the representa¬ 
tions. The types of transformations are designed in advance but the parameters are learned 
from data. Through layer-by-layer transformations, DeepMemory can model complicated 
relations between sequences necessary for applications such as machine translation between 
distant languages. The architecture can be trained with normal back-propagation on sequence- 
to-sequence data, and the learning can be easily scaled up to a large corpus. DeepMemory 
is bro ad enough to subsume the state-of-the-art neural translation model in ( jBahdanau et al. 


|2Q15| ) as its special case, while significantly improving upon the model with its deeper ar¬ 
chitecture. Remarkably, DeepMemory, being purely neural network-based, can achieve 
performance comparable to the traditional phrase-based machine translation system Moses 
with a small vocabulary and a modest parameter size. 


1 Introduction 


Sequence-to-sequence learning is a fund amental problem in natural language processin g, with many important 
applications such as machine translation (Bahdanau et al.j 2015[ Sutskever et al. 2Q14|), part-o f-speech tagging 
( Collobert et al.[[MTT||Vinyals et al.j 2014j ) and dependency parsing ( Chen & Manning} [2014| ). Recently, there 
has been significant progress in development of technologies for the task using purely neural network-based 
models. Without loss of generality, we consider machine translation in this paper. Previous efforts on neural 
machine translation generally fall into two categories: 

• Encoder-Decoder: As illustrated in left panel of Figure models of this type first summarize the 
source sentence into a fixed-length vector by the encoder, typically implemented with a recurrent neu¬ 
ral network (RNN) or a convolutional neural network (CNN), and then unfold the vector into the target 
sentence by the decoder, typically implemented with a RNN ( Auli et aLl|2Q13[[Kalchbrenner & Blun 
[som||2013 jCho et ak][2014t [Sutskever et al.||2014'] ); 


Attention-Model: with RNNsearch ( [Bahdanau et al.| [2015[ [Luong et"STl [2015[ ) as represent ative, it 
represents the source sentence as a sequence of vectors after a RNN (e.g., a bi-directional RNN ( [Schus-[ 
ter & Paliwal 1997[ )), and then simultaneously conducts dynamic alignment with a gating neural net¬ 


work and generation of the target sentence with another RNN, as illustrated in right panel of Figure 


Empirical comparison between the approaches indicates that the attention-model is more efficient than the 
encoder-decoder approach: it can achieve comparable results with far less parameters and training in¬ 
stances ( [Jean et aLl[2015[ ). This superiority in efficiency comes mainly from the mechanism of dynamic align¬ 
ment, which avoids the need to represent the entire source sentence with a fixed-length vector ([Sutskever et al. 

*The work was done when the first author worked as intern at Noah’s Ark Lab, Huawei Technologies. 
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Encoder-Decoder Attention model 




Figure 1: Two types of neural machine translators. Note the pictorial illustrations may deviate from individual 
models e.g., Sutskever et al. ( 2Q14| ), on modeling details. 


1.1 Deep Memory-based Architecture 


Both en coder-decoders and a ttention models can be reformalized in the language of Neural Turing Machines 
(NTM) ( [Graves et al.[ 2Q14| ), by replacing different forms of representations as content in memories and the 


operations on them as basic neural net-controlled read-w rite actions, as illustrate d in the left panel of Figure 
This is clear after realizing that the attention mechanism ( jBahdanau et al.j |2Q15| ) is essentially a special case of 
reading (in particular with content-based addressing ) in NTM on the memory that contains the representation 
of source sentence. More importantly, under this new view, the whole process becomes transforming the source 
sentence and putting it into memory (vector or array of vectors), and reading from this memory to further 
transform it into the target sentence. This architecture is intrinsically shallow in terms of the transformations 
on the sequence as an object, with essentially one hidden “layer”, as illustrated in the left panel of Figure 
Note that although RNN (as encoding/decoding or equivalently as controller in NTM) can be infinitely deep, 
this depth is merely for dealing with the temporal structure within the sequence. On the other hand, many 
sequence-to-sequence tasks, e.g, translation, are intrinsically complex and calls for more complex and powerful 
transformation mechanism than that in encoder-decoder and attention models. 



Figure 2: Space holder for NTM view of things 


For this reason, we propose a novel deep memory-based architecture, named DeepMemory, for sequence- 
to-sequence learning. As shown in the right panel of Figure DeepMemory carries out the task through a 
series of non-linear transformations from the input sequence, to different levels of intermediate memory-based 
representations, and eventually to the final output sequence. DeepMemory is essentially a customized and 
deep version of NTM with multiple stages of operations controlled by a program, where the choices of “layers” 
and types of read/write operations between layers are tailored for a particular task. 


Through layer-by-layer stacking of transformations on memory-based representations, DeepMemory gener¬ 
alizes the notion of inter-layer nonlinear mapping in neural networks, and therefore introduces a powerful new 
deep architecture for sequence-to-sequence learning. The aim of DeepMemory is to learn the representation 
of sequence better suited to the task (e.g., machine translation) through layer-by-layer transformations. Just as in 
deep neural network (DNN), we expect that stacking relatively simple transformations can greatly enhance the 
expressing power and the efficiency of DeepMemory, especially in handling translation between languages 
with vastly different nature (e.g., Chinese and English) and sentences with complicated structures. DeepMem¬ 
ory naturally subsumes current neural machine translation models ( jBahdanau et al. 2015[ Sutskever et al.j 
2Q14| ) as special cases, but more importantly it accommodates many deeper alternatives with more modeling 


power, which are empirically superior to the current shallow architectures on machine translation tasks. 


Although DeepMemory is initially proposed for machine translation, it can be ada pted for other task s that 
require substantial transformations of sequences, including paraphrasing, reasoning ( Peng et al. 2015| ), and 
semantic parsing ( [Yin et al^ [2015] ). Also, in defining the layer-by-layer transformations, we can go beyond 
the read-write operations prop osed in (Graves et al.j[^14] ) and design differentiable operations for the specific 
structures of the task (e.g., in ( [Yin et al." 2Q15| )). 
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RoadMap We will first discuss in Section]^ the read-write operations as a new form of non-linear transfor¬ 
mation, as the building block of DeepMemory. Then in Section we stack the transformations together to 
get the full DeepMemory architecture, and discuss several architectural variations of it. In Sectionj^we report 
our empirical study of DeepMemory on a Chinese-English translation task. 

2 Read-Write as a Nonlinear Transformation 

We start with discussing read-write operations between two pieces of memory as a generalized form of nonlin¬ 
ear transformation. As illustrated in Figure (left panel), this transformation is between two non-overlapping 
memories, namely R-memory and lU-memory, with lU-memory being initially blank. A controller operates the 
read-heads to get the values from R-memory (“reading”), which are then sent to the write-head for modifying 
the values at specific locations in VL-memory (“writing”). After those operations are completed, the content 
in R-memory is considered transformed and written to W -memory. These operations therefore define a trans¬ 
formation from one representation (in R-memory) to another (in lU-memory), which is pictorially noted in the 
right panel of Figure 


1/1/-Memory 


/?-Memory 



l/l/-memory 


write-head 


~T 




read-heads 


X 


/?-memory 


Figure 3: Read-write as an nonlinear transformation. 

These basic components are more formally defined below, following those in a generic NTM ( [Graves et al. 
2Q14| ), with however important modifications for the nesting architecture, implementation efficiency and de¬ 
scription simplicity. 


Memory: a memory is generally defined as a matrix with potentially infinite size, while here we limit our¬ 
selves to pre-determined (pre-claimed) N x d matrix, with N memory locations and d values in each location. 
In our implementation of DeepMemory, N is always instance-dependent and is pre-determined by the algo- 
rithn|^ Memories of different layers generally have different d. Now suppose for one particular instance (index 
omitted for notational simplicity), the system reads from the R-memory (Ad^, with units) and writes to 
lU-memory (denoted with AA units) 


R-memory: = {x^, x^, ••• , lU-memory: 

with x^ G and x^ G . 


^2 : 




}, 


Read/write heads: a read-head gets the values from the corresponding memory, following the instructions of 
the controller, which also influences the controller in feeding the state-machine. DeepMemory allows multiple 
read-heads for one controller, with potentially different addressing strategies (see Section [ZT] for more details). 
A write-head simply takes the instruction from controller and modifies the values at specific locations. 


Controll er: The core to the controller is a state machine, implemented as a RNN Long Short-Term Memory 
(LSTM) ( [Hochreiter & Schmidhuber||l997| ), with state at time t denoted as St (as illustrated in Figure [^. With 
St, the controller determines the reading and writing at time t, while the return of reading in turn takes part in 
updating the state. For simplicity, only one reading and writing is allowed at one time step, but more than one 
read-heads are allowed. The main equations for controllers are then 

Read vector: = Fr(A1^, sp, 0r) 

Write vector: = Fy^{sp 0w) 

State update: st+i = rp 0d) 

4t is possible to let the controller learn to determine the length of the memory, but that does not yield better performance 
on our tasks and is therefore omitted here. 
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where Fd('), Fr(') and Fw(') are respectively the operators for dynamics, reading and writing parameterized 
by 0D, 0R, and 0^. In DeepMemory, 1) it is only allowed to read from memory of lower layer and write to 
memory of higher layer, and 2) reading memory can only be performed after finishing the writing to it. 

The above read-write operations transform the representation in i?-memory to the new representation in lU- 
memory, while the design choice of read-write specifies the inner structure of the memories in lU-memory. 
The transformation is therefore jointly specified by the read-write strategies (e.g., the addre ssing described 
in Section HD and the parameters learned in a supervised fashion (described later in Section [3^ . Memory 
with the designed inner structure, in this particular case vector array with instance specific length, offers more 
flexibility than a fixed-length vector in representing sequences. This representational flexibility is particularly 
advantageous when combined with proper reading-writing strategy in defining nonlinear transformations for 
sequences, which will serve as the building block of the deep architecture. 


2.1 Addressing 

2.1.1 Addressing for Reading 

Location-based Addressing With location-based addressing (L-addressing), the reading is simply = x^. 
Notice that with L-addressing, the state machine automatically runs on a clock determined by the spatial struc¬ 
ture of i?-memory. Following this clock, the write-head operates the same number of times. One important 
variant, as suggested in ( Bahdanau et al. 2015 [ [Sutskever et 3^1 |2Q 14 1, is to go through i?-memory backwards 
after the forward reading pass, where the controller RNN has the same structure but is parameterized differently. 


Content-based Addressing With a content-based addressing (C-addressing), the return at t is 


Yt = Fr{M^, St; 0r) = x^; 0 r)x^, 

n=l 


and g(s(,x^;0R) = 


fl(st,x^;0R) 

E^)^''=i5'(st,x^,;0R)’ 


where ^(st,x^;0R), implemented as a DNN, gives an un-normalized “affliiation” score for unit in R- 


memory. Clearly it is related to the attention mecha nism introduced in ([Bahdanau et al.| |2015| ) for machine 
translation and general attention models discussed in ( [Gregor et al.[ |2Q15| ) for computer vision. Content-based 
addressing offers the following two advantages in representation learning: 

1. it can focus on the right segment o f the representation, as demonstrated by the automatic alignment 
observed in ( [Bahdanau et al. 2015), therefore better preserving the information in lower layers; 

2. it provides a way to alter the spatial structure of the sequence representation on a large scale, for which 
the re-ordering in machine translation is an intuitive example. 


Hybrid Addressing: With hybrid addressing (Ff-addressing) for reading, we essentially use two read-heads 
(can be easily extended to more), one with L-addressing and the other with C-addressing. At each time t, the 
controller simply concatenates the return of two individual read-heads as the final return: 

y]5(st,x^;0R)x^]. 

n=l 

It is worth noting that with Ff-addressing, the tempo of the state machine will be determined by the L-addressing 
read-head, and therefore creates lU-memory of the same number of locations in writing. As shown later, H- 
addressing can be readily extended to allow read-heads to work on different memories. 


2.1.2 Addressing for Writing 

Location-based Addressing With L-addressing, the writing is simple. At any time t, only the location in 

VF-memory is updated: = TV(st; 0w)5 which will be kept unchanged afterwards. For both location- 

and content-based addressing, Fw(st; 0w) is implemented as a DNN with weights 0w. 


Content-based Addressing In a way similar to C-addressing for reading, the units to write is determined 
through a gating network g{st, x^ ^; 0w), where the values in VF-memory at time t is given by 


= g{st, x^ t; 0w)-Fw(s(; 0w), x* ^ = (1 - at)x^^t_t + n = 1,2, ■ ■ ■ ,N^ 


Note that our definition of writing is slightly different from that in (Graves et al. 


20141. 
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where ^ stands for the values of the n 


th 


location in lU-memory at time t, a is the forgetting factor (similarly 
defined as in ( [Graves et al.[|2014| )), g is the normalized weight (with unnormalized score implemented also with 
a DNN) given to the location at time t. 


2.2 Types of Nonlinear Transformations 


As the most “conventional” special case, if we use L-addressing 
for both reading and writing, we actually get the familiar struc- 
ture in units found in RNN with stacked layers ( jPascanu et al.j 


2014). Indeed, as illustrated in the figure right to the text, this 


l/l/-memory: 

/?-memory: 



read-write strategy will invoke a relatively local dependency based on the original spatial order in i?-memory. 
It is not hard to show that we can recover some deep RNN model in ( jPascanu et al.j[2014] ) after stacking layers 
of read-write operations like this. This deep arc hitecture actually parti ally accounts for the great performance 
of the Google neural machine translation model ( jSutskever et al. 2014| ). 


The C-addressing, however, be it for reading and writing, offers a means of major reordering on the units, 
while iif-addressing can add into it the spatial structure of the lower layer memory. In this paper, we consider 
four types of transformations induced by combinations of the read and write addressing strategies, listed pic- 
torially in Figure]^ Notice that 1) we only include one combination with C-addressing for writing since it is 
computationally expensive to optimize when combined with a C-addressing reading (see Section [3!2| for some 
analysis), and 2) for one particular read-write strategy there are still a fair amount of implementation details to 
be specified, which are omitted due to the space limit. One can easily design different read/write strategies, for 
example a particular way of iJ-addressing for writing. 



Figure 4: Examples of read-write strategies. 


3 DeepMemory: Stacking Them Together 

As illustrated in Eigure[^(left panel), the stacking is straightforward: we can just apply a transformation on top 
of another, with the W -memory in lower layer being the ^-memory of upper layer. The entire deep architecture 
of DeepMemory, with diagram in Eigurej^ (right panel), can be therefore defined accordingly. Basically, it 
starts with a symbol sequence (Layer-0), then moves to the sequence of word embeddings (Layer-1), through 
layers of transformation to reach the final intermeidate layer (Layer-L), which will be read by the output layer. 

The operations in output layer, relying on another LSTM to generate the target sequence, are similar to a memory 
read-write, with the following two differences: 

• it predicts the symbols for the target sequence, and takes the “guess” as part of the input to update the 
state of the generating LSTM, while in a memory read-write, there is no information flow from higher 
layers to the controller; 

• since the target sequence in general has different length as the top-layer memory, it takes only pure 
C-addressing reading and relies on the built-in mechanism of the generating LSTM to stop (i.e., after 
generating a End-of-Sentence token). 

Memory of different layers could be equipped with different read-write strategies, and even for the same strategy, 
the configurations and learned parameters are in general different. This is in contrast to DNNs, for which the 
transformations of different layers are more homogeneous (mostly linear transforms with nonlinear activation 
function). A sensible architecture design in combining the nonlinear transformations can greatly affect the 
performance of the model, on which however little is known and future research is needed. 
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Figure 5: Illustration of stacked layers of memory (left) and the overall diagram of DeepMemory (right). 


3.1 Cross-Layer Reading 


In addition to the generic read-write strategies in Section 2.2 we also introduce the cross-layer reading into 
DeepMemory for more modeling flexibility. In other words, for writing in any Layer-^, DeepMemory 
allows reading from more than one layers lower than £, instead of just Layer-^—1. More speciflcally, we 
consider the following two cases. 


Memory-Bundle: A Memory-Bundle, as shown 
in Figure (left panel), concatenates the units of two 
aligned memories in reading, regardless of the address¬ 
ing strategy. Formally, the location in the bundle 

of memory Layer-^' and Layer-^" would be Xn ^ = 
[(x^)^,(x^ )^]^- Since it requires strict alignment be¬ 
tween the memories to put together, Memory-Bundle 
is usually on layer s cre ated with spatial structure of same 
origin (see Section [33] for examples). 


Memory Layer-^ 


Memory Layer-^ 

ft 

t ft 

Memory Layer-^' 


Memory Layer-^' 

- wr —- 

Memory Layer-^ " 


Memory Layer-^ " 

Memory-Bundle 

Short-Cut 


Figure 6: Cross-layer reading. 


Short-Cut Unlike Memory-Bundle, Short-Cut allows reading from layers with potentially different 
inner structures by using multiple read-heads, as shown in Figure (right panel). For example, one can use 
a C-addressing read-head on memory Layer-^' and a L-addressing read-head on Layer-^" for the writing to 
memory Layer-^ with £' < £. 


3.2 Optimization 


For any designed architecture, the parameters to be optimized include {©d, ©r, ©w} for each controller, the 
parameters for the LSTM in the output layer, and the word-embeddings. Since the reading from each memory 
can only be done after the writing on it completes, the “feed-forward” process can be described in two scales: 1) 
the flow from memory of lower layer to memory of higher layer, and 2) the forming of a memory at each layer 
controlled by the corresponding state machine. Accordingly in optimization, the flow of “correction signal” 
also propagates at two scales: 


• On the “cross-layer” scale: the signal starts with the output layer and propagates from higher layers to 
lower layers, until Layer-1 for the tuning of word embedding; 

• On the “within-layer” scale: the signal back-propagates through time (BPTT) controlled by the cor¬ 
responding state-machine (LSTM). In optimization, there is a correction for each reading or writing 
on each location in a memory, making the C-addressing more expensive than L-addressing for it in 
general involves all locations in the memory at each time t. 


The optimization can be done via the standard back-propagation (BP) aiming to maximize the likelihood of the 
target sequence. In practice, we use the s tandard stoch astic gradient descent (SGD) and mini-batch (size 80) 
with learning rate controlled by AdaDelta ( Zeiler[ 2Q12| ). 
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3.3 Architectural Variations of DeepMemory 


We discuss four representative special cases of DeepMemory: Arc-I, II, III and IV, as novel deep architec¬ 
tures for machine translation. We also show that current neural machine translation models like RNNsearch can 
be described in the framework of DeepMemory as a relatively shallow case. 


Output Layer 

JL 


: [ 


Output Layer 


1 Memory Layer-3 | 

1 Memory Layer-3 j 

W 

t 

1 Memory Layer-2 j 

1 Memory Layer-2 j 

-- 

__ 


Memory Layer-1 (word embeddings) Memory Layer-1 (word embeddings) 


Memory Layer-0 (symbols) 


Memory Layer-0 (symbols) 


Arc-I ^ 


Arc-I The first proposal, including two variants (Arc- 
Ihyb and Arc-Iloc), is designed to demonstrate the effect 
of C-addressing reading between intermediate memory 
layers, with diagram shown in the figure right to the text. 

Both variants employ a L-addressing reading from mem¬ 
ory Layer-1 (the embedding layer) and L-addressing writ¬ 
ing to Layer-2. After that, Arc-Ihyb writes to Layer-3 (L- 
addressing) based on its -addressing reading (two read- 
heads) on Layer-2, while Arc-Iloc uses L-addressing to 
read from Layer-2. Once Layer-3 is formed, it is then put 
together with Layer-2 for a Memory-Bundle, from which the output layer reads (C-addressing) for predict¬ 
ing the target sequence. Memory-Bundle, with its empirical advantage over single layers (see Section [42| ), 
is also used in other three architectures for generating the target sequence or forming intermediate layers. 

Arc-II As an architecture similar to Arc-Ihyb, Arc-II is designed to in- C 
vestigate the effect of iJ-ad dress ing reading from different layers of memory 
(or Short-Cut in Section |3.1| ). It uses the same strategy as Arc-Ihyb in 
generating memory Layer-1 and 2, but differs in generating Layer-3, where 
Arc-II uses C-addressing reading on Layer-2 but L-addressing reading on 
Layer-1. Once Layer-3 is formed, it is then put together with Layer-2 as a 
Memory-Bundle, which is then read by the output layer for predicting the 
target sequence. Arc-II 


Arc-I 


Output Layer 

~nr~ 




Memory Layer-3 


X 


Memory Layer-2 


[Memory Layer-1 (word embeddings^ 


Memory Layer-0 (symbols) 


Arc-III We intend to use this design to study a deeper architecture and more 
complicated addressing strategy. Arc -III follows the same way as Arc -II 
to generate Layer-1, Layer-2 and Layer-3. After that it uses two read-heads 
combined with a L-addressing write to generate Layer-4, where the two read- 
heads consist of a L-addressing read-head on Layer-1 and a C-addressing read- 
head on the memory bundle of Layer-2 and Layer-3. After the generation of 
Layer-4, it puts Layer-2, 3 and 4 together for a bigger Memory-Bundle to 
the output layer. Arc-III, with 4 intermediate layers, is the deepest among the 
four special cases. 
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Arc-IV This proposal is designed to study the efficacy of C-addressing writ¬ 
ing in forming intermediate representation. It employs a L-addressing reading 
from memory Layer-1 and L-addressing writing to Layer-2. After that, it uses 
a L-addressing reading on Layer-2 to write to Layer-3 with C-addressing. For 
C-addressing writing to Layer-3, all locations in Layer-3 are randomly initial¬ 
ized. Once Layer-3 is formed, it is then bundled with Layer-2 for the reading 
(C-addressing) of the output layer. 

3.3.1 Relation to other neural machine translators 



Output Layer 


Memory Layer-1 (word embeddings) 


Memory Layer-0 (symbols) 

Arc-IV 


As pointed out earlier, RNNsearch ( [Bahdanau et al.| |2Q15| ) with its automatic alignment, is a special case 
of DeepMemory with shallow architecture. As pictorially illustrated in Figure [7j it employs L-addressing 
reading on memory Layer-1 (the embedding layer), and L-addressing writing to Layer-2, which then is read 
(C-addressing) by the output layer to generate the target sequence. As shown in Figure Layer-2 is the only 
intermediate layer created by nontrivial read-write operations. 


On the other hand, the connection between DeepMemory and encoder-decoder architectures is less obvious 
since they usually require the reading from only the last cell (i.e., for a fixed-length vector representation) 
between certain layers. More specifically, |Sutskever et al.|(|2014| ) can be viewed as DeepMemory with stacking 
layers of L-addressing read-write (described in Section |2.2[ ) for both the encoder and decoder part, while the 
two are actually connected through last hidden states of the LSTMs of the corresponding layers. 
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Attention-model diagram 



DeepMemory diagram 


I Output Layer | 

Memory Layer-2 

T 

Memory Layer-1 (word embeddings) 


Memory Layer-0 (symbols) 


Figure 7: RNNsearch as a special case of DeepMemory. 


4 Experiments 

We report our empirical study on applying DeepMemory to Chinese-to-English translation. Our training data 
consist of 1.25M sentence pairs extracted from LDC corpora, with 27.9M Chinese words and 34.5M English 
words respectively. We choose NIST 2002 (MT02) dataset as our development set, and the NIST 2003 (MT03), 
2004 (MT04) and 2005 (MT05) datasets as our test sets. We u se the case-insensitive 4-gram NIST BLEU score 
as our evaluation metric, and sign-test ( [Collins et aH [2005] ) as statistical significance test. In training of the 
neural networks, we limit the source and target vocabularies to the most frequent 16K words in Chinese and 
English, covering approximately 95.8% and 98.3% of the two corpora respectively. 

We compare our method with two state-of-the-art SMT and NMlj^models: 


• Moses ( Koehn et~aLl 2007| ): an open source phrase-based translation system with default configuration 
and a 4-gram language model trained on the target portion of training data with ( jStolcke et al.||2002j ); 

• RNNsearch ( jBahdanau et al.j |2Q15| ): an attention-based NMT model with default setting 
(RNNsearchDEEAULx), as well as an optimal re-scaling of the model (on sizes of both embedding and 
hidden layers, with about 50% more parameters) (RNNsearchBEsx)- 

Eor a fair comparison, 1) the output layer in eachDEEPMEMORY variant is implemented as Gated Recurrent 
Units (GRU) in ( jBahdanau et al.|[MT5] ), and 2) all the DeepMemory architectures are designed to have the 
same embedding size as in RNNsearchoEEAULx with parameter size less or comparable to RNNsearchgEsx- 


4.1 Results 


The main results of different models are given in Table RNNsearch (best) is about 1.7 points behind Moses 
in BLEU on average, which is consistent with the observations made by other authors on different machine 
translation tasks ( Bahdanau et al.j 2015}[Jean et al.|[2015| ). Remarkably, some sensible designs of DeepMem¬ 
ory (e.g., Arc-II) can already achieve performance comparable to Moses, with only 42M parameters, while 
RNNsearchgEsx has 46M parameters. 


Clearly all DeepMemory architectures yield performance significantly better than (Arc-Ihyb, Arc-II & 
Arc-III) or comparable (Arc-Iloc & Arc-IV) to the NMT baselines. Among them, Arc-II outperforms 
the best setting of NMT baseline (RNNsearcheEsx), by about 1.5 BLEU on average with less parameters. 


Systems 

MT03 

MT04 

MT05 

Average 

Parameters # 

RNN searchoEFAULx 

29.02 

31.25 

28.32 

29.53 

31M 

RNNsearcheEsx 

30.28 

31.72 

28.52 

30.17 

46M 

ARC-Iloc 

28.98 

32.02 

29.53* 

30.18 

54M 

ARC-Ihyb 

30.14 

32.70* 

29.40* 

30.75 

54M 

Arc-II 

31 . 27 * 

33.02* 

30 . 63 * 

31.64 

42M 

Arc-III 

30.15 

33 . 46 * 

29.49* 

31.03 

53M 

Arc-IV 

29.88 

32.00 

28.76 

30.21 

48M 

Moses 

31.61 

33.48 

30.75 

31.95 

- 


Table 1: BLEU-4 scores (%) of NMT baselines: RNNsearchoEEAULx and RNNsearcheEsx, DeepMemory 
architectures (Arc-I, II, III and IV), and phrase-based SMT system (Moses). The indicates that the results 
are significantly (p<0.05) better than those of the RNNsearcheEsx- 

^There are recent progress on aggregating multiple models or enlarging the vocabulary(e.g., in |jean et af 2015| )), but 
here we focus on the generic models. 
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4.2 Discussion 

About Depth: A more detailed comparison be¬ 
tween RNNsearch (two layers), Arc -II (three lay¬ 
ers) and Arc -III) (four layers), both quantitatively 
and qualitatively, suggests that with deep architec¬ 
tures are essential to the superior performance of 
DeepMemory. Although the deepest architec¬ 
ture Arc-III is about 0.6 BLEU behind the Arc- 
II, its performance on long sentences is signifi¬ 
cant better. Figure shows the BLEU scores of 
generated translations on the test sets with respect 
to the length of the source sentences. In particu¬ 
lar, we test the BLEU scores on sentences longer 
than {0, 10, 20, 30, 40, 50, 60} in the merged 
test set of MT03, MT04 and MT05. Clearly, on 
sentences with length >30, Arc -III yields con¬ 
sistently higher BLEU scores than Arc -11. This 
observation is further confirmed by our observa¬ 
tions of translation quality (see Appendix) and is 
consistent with our intuition that DeepMemory 
with its multiple layers of transformation, is espe¬ 
cially good at modeling the transformations of rep¬ 
resentations essential to machine translation on rel¬ 
atively complicated sentences. 

About C-addressing Read: Further comparison between Arc-Ihyb and Arc-Iloc (similar parameter sizes) 
suggests that C-addressing reading plays an important role in learning a powerful transformation between in¬ 
termediate representations, necessary for translation between language pairs with vastly different syntactical 
structures. This conjecture is further verified by the good performances of Arc-II and Arc-III, both of which 
have C-addressing read-heads in their intermediate memory layers. However, if memory Layer-^+1 is formed 
with only C-addressing read from memory Layer-^, and serves as the only going to later stages, the perfor¬ 
mance is usually less satisfying. Comparison of this design with iif-addressing (results omitted here) suggests 
that another read-head with L-addressing can prevent the transformation from going astray by adding the tempo 
from a memory with a clearer temporal structure. 

About C-addressing Write: The BLEU scores of Arc-IV are lower than that of Arc-II but comparable 
to that of RNNsearchgEST, suggesting that writing with C-addressing alone yields reasonable representation. A 
closer look shows that although Arc-IV performs poorly on very long sentences (e.g., source sentences with 
over 60 words), it does fairly well on sentences with normal length. More specifically, on source sentences 
with length < 40, it outperforms RNNsearchgEST with 0.79 BLEU points. One possible explanation is that our 
particular implementation of C-addressing for writing in Arc-IV (Section [33] ) relies heavily on the randomly 
initialized content and is hard to optimize, especially when the structure of the sentence is complex, which 
might need to be “guided” by another write-head or some smart initialization. 

About Cross-layer Read: As another observation, cross-layer reading almost always helps. The perfor¬ 
mances of Arc-I, II, III and IV unanimously drop after removing the Memory-Bundle and Short-Cut 
(results omitted here), even after the broadening of memory units to keep the parameter size unchanged. It might 
be due to the flexibility gained in mixing different addressing modes and representations of different stages. 



Figure 8: The BLEU scores of generated translations on 
the merged three test sets with respect to the lengths of 
source sentences. The numbers on X-axis of the figure 
stand for sentences longer than the corresponding length, 
e.g., 30 for source sentences with > 30 words. 


5 Conclusion 

We propose DeepMemory, a novel architecture for sequence-to-sequence learning, which is stimulated by 
the recent work of Neural Turing Machine ( [Graves et al.j [2014] ) and Neural Machine Translation ( [Bahdana^ 
|et al.[[2015] ). DeepMemory builds its deep architecture for processing sequence data on the basis of a series of 
transformations induced by the read-write operations on a stack of memories. This new architecture significantly 
improves the expressive power of models in sequence-to-sequence learning, which is verified by our empirical 
study on a benchmark machine translation task. 
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APPENDIX: Actual Translation Examples 

In appendix we give some example translations from DeepMemory, more specically, Arc-II and Arc- 
III, and compare them against the reference and the translation given by RNNsearch. We will focus on long 
sentences with relatively complicated structures. 

Example Translation oe Arc-II 


Source 

^ ^ iStr 

Reference 

yesterday opposition parliament members refused to vote on the constitutional amendment 

bill proposed by the government 

RNNsearch 

the members of the opposition parliament have refused to accept the decision on the 

amendment of the constitution 

DeepMemory (Arc-II) 

the opposition parliament members yesterday refused to vote on the constitutional 

amendment proposed by the government 


Source 

S ijfis 't’ ®ij? 7 S Ith NIS ± * M H iSff fp t* 

Reference 

in his address^ Shevardnadze stressed the importance of the exchange of intelligence 

with russia and the united states on this matter. 

RNNsearch 

in his speech ^ UNK stressed the importance of cooperation and exchange between the 

two countries on this question . 

DeepMemory (Arc-II) 

in his speech ^ UNK stressed the importance of cooperation and exchange of intelligence 

between russia and the united states on this issue . 


Source 

, mm mw 'Kis rs. w it* mm + 

3E ±t!l|x tf] « fflIR 

Reference 

bordyuzha noted that the creation of a fast reaction force by the cis collective security 

treaty organization would help contain the strength of extremist groups in central asia 

RNNsearch 

UNK pointed out that the establishment of the organization of the commonwealth of 

independent states ( cis ) is conducive to the maintenance of extremist organizations 

in the central asian region 

DeepMemory (Arc-II) 

UNK noted that the establishment of a rapid reaction force in the organization of 

collective security of the commonwealth of independent states is conducive to curbing 

the extremist forces in the central asian region 


Source 

. m ^ m mix & ^ mm m 

ft S — ajK EIE j?f M M HIC ^ o 

Reference 

england said the pentagon verified the identity of this prisoner one or two days ago 

and that the state department will notify the home country of the prisoner of this 

decision and make arrangements for his release. 

RNNsearch 

england said that the pentagon was established in a and two days ago that the state 

council would inform the state of the decision and arrangements for the release of the 

detainees . 

DeepMemory (Arc-II) 

england said that the pentagon had confirmed the identity of the prisoners before one 

or two days ago ^ and the state council would notify the countries involved in the decision 

and arrange the release of the prisoners . 
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Example Translation of Arc-III 


Source 

mm stti mm , m ** 

m Rm ,mm mm n « r mam fniss ± nii . 

Reference 

us deputy secretary of state armitage issued warnings to syria today^ saying the 

development of future relations between damascus and Washington depends on whether syria 

wants to cooperate with the united states on the issues of iraq and lebanon. 

RNNsearch 

us deputy secretary of state richard armitage issued a warning to syria today that the 

development of damascus , Washington , and Washington could see if syria is willing 

to cooperate with the united states in iraq and lebanon . 

DeepMemory (Arc-Ill) 

us deputy secretary of state richard armitage issued a warning today that the development 

of the damascus - Washington relations in the future will depend on whether syria is 

willing to cooperate with the united states on iraq and lebanon issues . 


Source 

I aff 1^?^, ip± 

Rmm /B , = # ^it w mmz- hir ±m m 

mm 4/^ , mmi ^ 3 * mis * -k mm mm . 

Reference 

the russian state petroleum company is merging with the russian state gas company, and 

with the inclusion of yuganskneftegaz, the combined three will account for a fifth of 

russia's oil output and all its natural gas production, making it the largest global 

energy group. 

RNNsearch 

the russian state - run oil company is merging with the state - owned gas company , 

and the state - owned gas corp . 

DeepMemory (Arc-III) 

russia 's state - owned petroleum company is merging with the russian state - owned 

gas company , and the third country will have to control the one - fifth of russian 

oil production and to become the biggest energy group in the world . 


Source 

# ± M S H ^ , {0 MtJ ^ ^ 6^ 

fp m ^ j wi m ^ je 

M fh 0 

Reference 

both greece and turkey are nato members, but the two have conflicts of interest and 

clashes over maritime and air territories and continental shelf boundary disputes in 

the aegean sea and the issue of Cyprus, which have affected the normalization of their 

bilateral relations. 

RNNsearch 

the two countries have maintained conflicts and contradictions in the fields of 

territorial waters , airspace , and islands , as well as contradictions and conflicts . 

DeepMemory (Arc-III) 

greece and turkey are the member countries of the nato , but the conflict of interests 

and contradictions between the two countries on the demarcation of the territorial 

waters , airspace , and the islands of the islands also adversely affect the 

normalization of relations between the two countries . 
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